Skip to content

Feat/multi git sync#729

Open
Maleware wants to merge 33 commits intomainfrom
feat/multi-git-sync
Open

Feat/multi git sync#729
Maleware wants to merge 33 commits intomainfrom
feat/multi-git-sync

Conversation

@Maleware
Copy link
Member

@Maleware Maleware commented Jan 4, 2026

Description

Part of #721

This PR adds support for multiple instances of git-sync.

This is done via a symlink onto the current folder of each git-sync instances.

In kubernetesExecutors a cp -r replaces the symlink as it is a one off anyways and it's easier to transition contents via volumeMounts and init-container.

Integration test 'mount-dags-gitsync' has been changed to use two git-sync resources at different branches of the same repository.

Definition of Done Checklist

  • Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant
  • Please make sure all these things are done and tick the boxes

Author

  • Changes are OpenShift compatible
  • CRD changes approved
  • CRD documentation for all fields, following the style guide.
  • Helm chart can be installed and deployed operator works
  • Integration tests passed (for non trivial changes)
  • Changes need to be "offline" compatible
  • Links to generated (nightly) docs added
  • Release note snippet added

Reviewer

  • Code contains useful comments
  • Code contains useful logging statements
  • (Integration-)Test cases added
  • Documentation added or updated. Follows the style guide.
  • Changelog updated
  • Cargo.toml only contains references to git tags (not specific commits or branches)

Acceptance

  • Feature Tracker has been updated
  • Proper release label has been added
  • Links to generated (nightly) docs added
  • Release note snippet added
  • Add type/deprecation label & add to the deprecation schedule
  • Add type/experimental label & add to the experimental features tracker

@Maleware
Copy link
Member Author

All tests passed locally apart from

--- FAIL: kuttl/harness/logging_airflow-3.1.6_openshift-false_executor-kubernetes (801.68s)
--- FAIL: kuttl/harness/logging_airflow-3.1.6_openshift-false_executor-celery (1022.60s)

which is a known issue and will be tackled separately.

@Maleware
Copy link
Member Author

Maleware commented Feb 27, 2026

RELEASE-NOTE-SNIPPET:

Airflow supports now multiple git-sync instances and thus multiple git repositories ( or multiple branches from one repository ). Note that the default path has changed while using git-sync to /stackable/app/allDAGs/current-{i}

@Maleware Maleware self-assigned this Feb 27, 2026
@adwk67 adwk67 moved this to Development: In Review in Stackable Engineering Mar 3, 2026
Copy link
Member

@adwk67 adwk67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of initial comments after running the tests.

# will expect 2 (two containers, base and gitsync)
- script: kubectl -n $NAMESPACE get cm airflow-executor-pod-template -o json | jq -r '.data."airflow_executor_pod_template.yaml"' | grep "AIRFLOW_TEST_VAR" | wc -l | grep 2
# will expect 1 (one container, gitsync)
# will expect 6 (2 from from the volume declaration + mounts to four containers, base and gitsync-0, gitsync-1 and multi-gitsync)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only get 5 here: 3 volume mounts (base, git-sync-0, git-sync-1) and 2 from volume declaration. Likewise, probably, with the others.

# activate DAG
response = requests.patch(
f"{rest_url}/dags/{dag_id}", headers=headers, json={"is_paused": False}
response_0 = requests.patch(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step fails for me as the DAGs are imported with errors:

Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/stackable/app/allDAGs/current-0/mount-dags-gitsync/dags_airflow3/pyspark_pi.py", line 41, in <module>
    with open(yaml_path, "r") as file:
         ^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/stackable/app/allDAGs/current-0/dags_airflow3/pyspark_pi.yaml'

Although this path does exist.
Tested with this:

[stackable@airflow-webserver-default-0 ~]$ airflow dags list-import-errors
/stackable/app/lib64/python3.12/site-packages/airflow/providers/common/compat/security/permissions.py:30 RemovedInAirflow4Warning: The airflow.security.permissions module is deprecated; please see https://airflow.apache.org/docs/apache-airflow/stable/security/deprecated_permissions.html
bundle_name | filepath                                                 | error                                                                                                                   
============+==========================================================+=========================================================================================================================
dags-folder | current-1/mount-dags-gitsync/dags_airflow3/pyspark_pi.py | Traceback (most recent call last):                                                                                      
            |                                                          |   File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed                                          
            |                                                          |   File "/stackable/app/allDAGs/current-1/mount-dags-gitsync/dags_airflow3/pyspark_pi.py", line 41, in <module>          
            |                                                          |     with open(yaml_path, "r") as file:                                                                                  
            |                                                          |          ^^^^^^^^^^^^^^^^^^^^                                                                                           
            |                                                          | FileNotFoundError: [Errno 2] No such file or directory: '/stackable/app/allDAGs/current-1/dags_airflow3/pyspark_pi.yaml'
            |                                                          |                                                                                                                         
dags-folder | current-0/mount-dags-gitsync/dags_airflow3/pyspark_pi.py | Traceback (most recent call last):                                                                                      
            |                                                          |   File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed                                          
            |                                                          |   File "/stackable/app/allDAGs/current-0/mount-dags-gitsync/dags_airflow3/pyspark_pi.py", line 41, in <module>          
            |                                                          |     with open(yaml_path, "r") as file:                                                                                  
            |                                                          |          ^^^^^^^^^^^^^^^^^^^^                                                                                           
            |                                                          | FileNotFoundError: [Errno 2] No such file or directory: '/stackable/app/allDAGs/current-0/dags_airflow3/pyspark_pi.yaml'
            |                                                          |                                                                                                                         
                                                                                                                                                                                                 
[stackable@airflow-webserver-default-0 ~]$ 

Copy link
Member

@adwk67 adwk67 Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice you have a different folder structure in your repo. In the one used for the tests we have:
gitFolder: "mount-dags-gitsync/dags_airflow3" and the dags in the pods are here:
/stackable/app/allDAGs/current-0/mount-dags-gitsync/dags_airflow3/ which doesn't match /stackable/app/allDAGs/current-0/dags_airflow3/pyspark_pi.yaml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think this is brittle:

    yaml_path = os.path.join(
        os.environ.get("AIRFLOW__CORE__DAGS_FOLDER"), "current-1", "dags_airflow3" ,"pyspark_pi.yaml"
    )

as the DAG itself has to know about the nestedness.

Copy link
Member Author

@Maleware Maleware Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, it slipped through. We'd need to teach the whole path to the DAG in the repository I guess.

I understand. We could introduce new env-variables which could be set via the operator like GIT_0 and GIT_1 pointing to the respective repositories. Then instead we could just

    yaml_path = os.path.join(
        os.environ.get("GIT_1") ,"pyspark_pi.yaml"
    )

The user would set the path-to-folder ( "mount-dags-gitsync/dags_airflow3" in this case ) and we prefix it with the AIRFLOW_DAG_FOLDER path.

For integration tests I think it would maybe be fine to reference like it is currently?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying with this which seems to work:

#yaml_path = os.path.join(
    #    os.environ.get("AIRFLOW__CORE__DAGS_FOLDER"), "current-0", "dags_airflow3" ,"pyspark_pi.yaml"
    #)
    dags_folder = Path(os.environ.get("AIRFLOW__CORE__DAGS_FOLDER", "/stackable/app/allDAGs")) / "current-0"
    matches = list(dags_folder.rglob("pyspark_pi.yaml"))
    if not matches:
      raise FileNotFoundError(f"pyspark_pi.yaml not found under {dags_folder}")
    if len(matches) > 1:
      raise RuntimeError(f"Multiple pyspark_pi.yaml files found under {dags_folder}: {matches}")
    yaml_path = matches[0]


If you want to access multiple branches of a repository or load dags from multiple repositories, you can extend the list shown above.

The default mount-path for git-sync resources is `/stackable/app/allDAGs/current-{i}` where i corresponds to:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the user needs to know this as it is (or should be?) an internal detail.

sshPrivateKeySecretName: git-sync-ssh
----

NOTE: Using DAG's from ConfigMaps will need you to either use celeryExecutors and mount them under `/stackable/app/allDAGs/<name>` or use kubernetesExecutor and mount them somewhere else and change the PYTHONPATH as well as the `AIRFLOW\__CORE__DAGS_FOLDER` accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NOTE: Using DAG's from ConfigMaps will need you to either use celeryExecutors and mount them under `/stackable/app/allDAGs/<name>` or use kubernetesExecutor and mount them somewhere else and change the PYTHONPATH as well as the `AIRFLOW\__CORE__DAGS_FOLDER` accordingly.
NOTE: Using DAGs from ConfigMaps will require you to either use celeryExecutors and mount them under `/stackable/app/allDAGs/<name>` or use the kubernetesExecutor and mount them somewhere else, changing the PYTHONPATH (if using submodules) and `AIRFLOW\__CORE__DAGS_FOLDER` accordingly.

Copy link
Member

@adwk67 adwk67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a breaking change? Will existing set-ups (DAGs mounted via CMs, single gitsync repos) still work? The versioning test (also using gitsync) does not pass for me (probably as it is still referencing the dag in the mani branch).

symlinks
}

// kubernetesExecuter needs copy since it needs init-container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give a reason for this (I think it was something to do with the symlink not resulting in "clean" paths, or something...)

container_debug_command(),
"airflow triggerer &".to_string(),
]),
// This essentially doesn't work for KubernetesExecutor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give a reason?

format!(
"cp -RL {CONFIG_PATH}/{AIRFLOW_CONFIG_FILENAME} {AIRFLOW_HOME}/{AIRFLOW_CONFIG_FILENAME}"
),
// Adding cm as dags within the same AIRFLOW_DAGS_FOLDER may lead to problems, thus checking if exists.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does mean that we can't simply have, say, 2 DAGs in a single ConfigMap, which are mounted to AIRFLOW_DAGS_FOLDER, or something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Development: In Review

Development

Successfully merging this pull request may close these issues.

3 participants